Int8 Quantization Support for DiT (Z-Image & Qwen-Image) by yjb767868009 · Pull Request #1470 · vllm-project/vllm-omni

yjb767868009 · 2026-02-25T07:44:00Z

Overview

This PR introduces online int8 quantization for DiT models in vLLM-Omni based. The currently supported model range follows the FP8 quantization, starting with Z-Image and Qwen-Image (text-to-image). Online int8 converts BF16/FP16 weights to int8 at model load time with dynamic activation scaling.

Device compatibility: W8A8
Per-layer control: ignored_layers lets users keep sensitive layers in BF16

Supported Models

Model	HF Models	Recommendation	`ignored_layers`
Z-Image	`Tongyi-MAI/Z-Image-Turbo`	All layers	None
Qwen-Image	`Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512`	All layers	None

Changes

Quantization framework (vllm_omni/diffusion/quantization/)

int8.py — DiffusionInt8Config with Int8Config for device (dynamic activation scaling, online weight conversion)
__init__.py — DiffusionInt8Config is added and registered to the supported quantization methods

Tests (tests/diffusion/quantization/)

test_int8_config.py — Unit tests covering config creation, vLLM config extraction, ignored_layers, dict non-mutation, conflicting method warnings, and end-to-end integration with OmniDiffusionConfig

Documentation (docs/)

user_guide/diffusion/quantization/overview.md — Quantization methods overview
user_guide/diffusion/quantization/int8.md — Usage guide (Python API + CLI), parameter reference, per-model recommendations
user_guide/diffusion_acceleration.md — Updated model support table with int8 column
.nav.yml — Added int8 quantization section to docs navigation

Example (examples/offline_inference/text_to_image/text_to_image.py)

Added --quantization int8

How to Use

from vllm_omni import Omni

# Z-Image: all layers quantized
omni = Omni(model="Tongyi-MAI/Z-Image-Turbo", quantization="int8")

# Qwen-Image: skip sensitive img_mlp layers
omni = Omni(
    model="Qwen/Qwen-Image-2512",
    quantization_config={
        "method": "int8",
        "ignored_layers": ["img_mlp"],
    },
)

# CLI
python text_to_image.py --model Tongyi-MAI/Z-Image-Turbo --quantization int8

python text_to_image.py --model Qwen/Qwen-Image-2512 --quantization int8 \
    --ignored-layers "int8"

Test Plan

Lint/type check passes with no import or type errors
python examples/offline_inference/text_to_video/text_to_video.py --quantization int8--model <wan_model> initializes without errors
Running without --quantization produces identical behavior (quant_config=None flows through as no-op)

Test Result

Quantization Quality Benchmark for GPU

Qwen-Image-2512

Config	Avg Time	Speedup	Memory (GiB)	Mem Reduction	Mean LPIPS
BF16 baseline	30.07s	—	65.18	—	(ref)
int8	20.45s	32%	47.64	20%	0.0197
int8 skip img_mlp	26.78s	11%	51.43	14%	0.0027

Z-Image-Turbo

Config	Avg Time	Speedup	Memory (GiB)	Mem Reduction	Mean LPIPS
BF16 baseline	22.69s	—	24.95	—	(ref)
int8	14.95s	34%	20.16	19%	0.1597
int8 skip feed_forward	20.32s	10%	22.80	9%	0.0290

Quantization Quality Benchmark for Atlas A2

Qwen-Image-2512

Config	Avg Time	Speedup	Memory (GiB)	Mem Reduction	Mean LPIPS
BF16 baseline	22.88s	—	59.93	—	(ref)
int8	21.18s	7%	47.75	20%	0.0312
int8 skip img_mlp	22.26s	3%	51.81	14%	0.0068

Z-Image-Turbo

Config	Avg Time	Speedup	Memory (GiB)	Mem Reduction	Mean LPIPS
BF16 baseline	16.95s	—	23.43	—	(ref)
int8	14.78s	13%	18.03	23%	0.1474
int8 skip feed_forward	16.24s	4%	21.63	8%	0.0337

Memory Profiling

Qwen-Image-2512, 1024x1024, 50 steps

Config	Weights	Activations	Peak	Total Reduction
BF16, TP=1	55.78 GB	9.4 GB	65.18 GB	-
Int8, TP=1	44.96 GB	9.76 GB	54.72 GB	16%
BF16, TP=2	43.13 GB	10.32 GB	53.45 GB	-
Int8, TP=2	37.17 GB	10.52 GB	47.69 GB	11%

Z-Image-Turbo, 1024x1024, 50 steps

Config	Weights	Activations	Peak	Total Reduction
BF16, TP=1	20.51 GB	4.44 GB	24.95 GB	-
Int8, TP=1	14.87 GB	5.29 GB	20.16 GB	19%
BF16, TP=2	14.83 GB	5.30 GB	20.13 GB	-
Int8, TP=2	11.84 GB	5.28 GB	17.12 GB	15%

Qwen-Image

bf16 vs int8 in GPU

bf16 vs int8 in NPU

Z-Image

bf16 vs int8 in GPU

bf16 vs int8 in NPU

Related Issues

RFC: vllm-omni supports int8 quantization on the NPU [RFC]: vllm-omni supports int8 quantization on the NPU #1245 [https://github.com/[RFC]: vllm-omni supports int8 quantization on the NPU #1245]
RFC: FP8 Quantization Support for Diffusion Transformers [RFC]: FP8 Quantization Support for Diffusion Transformers #935 [https://github.com/[RFC]: FP8 Quantization Support for Diffusion Transformers #935]

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please pasting the results comparison before and after, or e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54149deb01

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

lishunyang12 · 2026-02-25T09:26:10Z

May i ask your way to measure the memory usage? I didn't take kv cache into account and only record weight for model loaded. Seems like there is some discrepancies between the two.

yjb767868009 · 2026-02-25T09:40:09Z

May i ask your way to measure the memory usage? I didn't take kv cache into account and only record weight for model loaded. Seems like there is some discrepancies between the two.

I observed the peak VRAM during runtime by using npu-smi info similar to nvidia-smi.

lishunyang12

left a couple comments, mostly around the top-level torch_npu import

lishunyang12 · 2026-02-25T09:41:35Z

+from typing import TYPE_CHECKING, Any, Optional
+
+import torch
+import torch_npu


import torch_npu at module top level will crash on any non-NPU machine. Since __init__.py imports DiffusionInt8Config unconditionally, this breaks the entire vllm_omni.diffusion.quantization package — including FP8 codepaths.

Move this to a lazy import inside the methods that actually call torch_npu.* (e.g. apply(), process_weights_after_loading()).

Following your suggestion, I have using lazy import.

Looks good, thanks.

lishunyang12 · 2026-02-25T09:41:35Z

+        replace_parameter(layer, "weight_scale", weight_scale)
+
+
+class DiffusionInt8Config(DiffusionQuantizationConfig):


Missing quant_config_cls = Int8Config — without it get_name() from the base class raises NotImplementedError. See how DiffusionFp8Config sets quant_config_cls = Fp8Config.

Following your suggestion, I have added quantic_fig_cls = Int8Config.

Confirmed in the diff. Thanks.

lishunyang12 · 2026-02-25T09:41:36Z


 logger = logging.getLogger(__name__)

+CONDITION_IMAGE_SIZE = 384 * 384


The CONDITION_IMAGE_SIZE / VAE_IMAGE_SIZE refactor seems unrelated to int8 quantization. Worth splitting into its own PR to keep review scope tight.

This was caused by an incorrect commit and has now been removed from the branch.

Signed-off-by: juboyu <767868009@qq.com>

lishunyang12 · 2026-02-26T08:23:36Z

Can you post visual output for z-image?

yjb767868009 · 2026-02-27T02:17:36Z

Can you post visual output for z-image?

I will upload the visual output for z-image later.

Signed-off-by: JuboYu <767868009@qq.com>

Signed-off-by: juboyu <767868009@qq.com>

david6666666 · 2026-03-03T09:08:25Z

@codex review

hsliuustc0106 · 2026-03-03T09:12:27Z

any test result in gpu?

david6666666 · 2026-03-03T09:10:53Z

+
+## Device Compatibility for Int8
+
+| NPU Generation | Int8 Mode |


The GPU should also support it, right?

This version only supports NPUs. The GPU version needs to be developed. I don't think GPUs need INT8, there are better options for FP8, while NPU currently only supports INT8.

Maybe you should add a note for TODO GPU support int8

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a117002e76

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-03T09:14:42Z

+    def from_config(cls, config: dict[str, Any]) -> "Int8Config":
+        quant_method = cls.get_from_keys(config, ["quant_method"])
+        is_checkpoint_int8_serialized = "int8" in quant_method
+        activation_scheme = cls.get_from_keys(config, ["activation_scheme"])


Fall back to dynamic activation scheme in from_config

Int8Config.from_config treats activation_scheme as mandatory via get_from_keys, so any int8 quantization config that omits this field will raise during config parsing instead of using the class default. This breaks loading paths that rely on minimal quantization metadata (e.g., only quant_method present) even though __init__ already defines "dynamic" as the default scheme; switching to an optional lookup with a default avoids this hard failure.

Useful? React with 👍 / 👎.

…he prefix for quantization ignored_layers Signed-off-by: juboyu <767868009@qq.com>

Signed-off-by: JuboYu <767868009@qq.com>

david6666666 · 2026-03-19T13:34:02Z

Please fix pre-commit

Signed-off-by: juboyu <767868009@qq.com>

yjb767868009 · 2026-03-19T14:31:10Z

@lishunyang12 @david6666666 @hsliuustc0106 This PR has been prepared and is ready for your review.

…t#1470) Signed-off-by: juboyu <767868009@qq.com> Signed-off-by: JuboYu <767868009@qq.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>

…t#1470) Signed-off-by: juboyu <767868009@qq.com> Signed-off-by: JuboYu <767868009@qq.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>

…t#1470) Signed-off-by: juboyu <767868009@qq.com> Signed-off-by: JuboYu <767868009@qq.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>

Per PR vllm-project#1470's review template, "量化展示" requires two tables: the Summary table (already emitted) and a Memory Profiling table that breaks Peak into Weights + Activations. Capture the extra weights/activations numbers in `_generate_image` and `_generate_video` by snapshotting `memory_allocated()` right before each `generate()` call (= weights + persistent buffers already on device) and subtracting it from the post-generate `max_memory_allocated()` to get the activations delta. Surface the values in `run_benchmark` as a third markdown table ("### Memory Profiling") with the columns PR vllm-project#1470 used: Weights / Activations / Peak / Total Reduction, broken down by TP size (from `args.tensor_parallel_size`). First-prompt snapshot is canonical, matching the existing Peak column's "use first prompt's memory" convention. Signed-off-by: ultranationalism <www913363043@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ultism <www913363043@gmail.com>

…ponent glue Pivot from the closed PR vllm-project#1986 design (runtime nunchaku-format glue) to the post-RFC architecture where: - vLLM upstream (vllm-project/vllm) hosts the SVDQuant quantization config, linear method, dispatcher, and native SM_100/103 CuTe DSL kernel. - vllm-omni hosts only diffusion-specific glue and a one-time offline converter that emits canonical row-major NVFP4 checkpoints. Components: - `vllm_omni/quantization/tools/convert_nunchaku_to_svdquant.py`: Standalone converter that ingests nunchaku-published merged safetensors and emits a vLLM-loadable diffusers pipeline tree in canonical row-major + FP4 nibble pack. This is the layout the SM_100 native CuTe kernel consumes directly; for the nunchaku backend on consumer GPUs (SM_75-89 / SM_120), vLLM repacks to the PTX-MMA tile layout at load time in `SVDQuantLinearMethod.process_weights_after_loading`. The on-disk format is backend-agnostic. - `vllm_omni/quantization/tools/svdquant_nvfp4_layout.py`: Thin re-export shim. The actual layout adapters live in `vllm/model_executor/layers/quantization/utils/svdquant_nvfp4_layout.py` in vLLM proper; this file keeps the import surface stable for downstream code that referenced the original vllm-omni location. - `vllm_omni/quantization/component_config.py`: per-component quantization config wiring so per-pipeline-component (transformer, text_encoder, vae, etc.) quant config can be declared declaratively in `transformer/config.json["quantization_config"]` rather than runtime monkey-patching. Addresses the "blanket strict-validation disable" review concern from vllm-project#1986. - `vllm_omni/diffusion/models/z_image/z_image_transformer.py`: trailing -dot fix in `stacked_params_mapping` (replaces the per-model diffusers->vLLM key remapping from the closed PR; reduced to one line under the canonical row-major design). - `examples/offline_inference/text_to_image/text_to_image.py`: smarter quantization label resolution that mirrors `OmniDiffusionConfig._propagate_quantization_from_tf_config` so the startup banner reflects on-disk per-component quant config rather than printing "None (BF16)" for an already-quantized checkpoint. Canonical checkpoint produced by this converter: - HuggingFace: https://huggingface.co/ultranationalism/nunchaku-z-image-turbo-svdq - ModelScope: https://www.modelscope.cn/models/ultranationalism/Z-Image-Turbo-SVDQuant-NVFP4 Test plan (pending validation on consumer Blackwell SM_120): - E2E quantized Z-Image-Turbo inference on RTX 5090 - BF16 vs SVDQuant LPIPS quality benchmark per PR vllm-project#1470 template Refs: vllm-project#1986 (closed), RFC vllm-project/vllm#37908 AI assistance: this commit was produced with Claude Code assistance. Signed-off-by: ultranationalism <www913363043@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ultism <www913363043@gmail.com>

Per PR vllm-project#1470's review template, "量化展示" requires two tables: the Summary table (already emitted) and a Memory Profiling table that breaks Peak into Weights + Activations. Capture the extra weights/activations numbers in `_generate_image` and `_generate_video` by snapshotting `memory_allocated()` right before each `generate()` call (= weights + persistent buffers already on device) and subtracting it from the post-generate `max_memory_allocated()` to get the activations delta. Surface the values in `run_benchmark` as a third markdown table ("### Memory Profiling") with the columns PR vllm-project#1470 used: Weights / Activations / Peak / Total Reduction, broken down by TP size (from `args.tensor_parallel_size`). First-prompt snapshot is canonical, matching the existing Peak column's "use first prompt's memory" convention. Signed-off-by: ultranationalism <www913363043@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: ultism <www913363043@gmail.com>

…t#1470) Signed-off-by: juboyu <767868009@qq.com> Signed-off-by: JuboYu <767868009@qq.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>

yjb767868009 requested a review from hsliuustc0106 as a code owner February 25, 2026 07:44

chatgpt-codex-connector Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/quantization/int8.py Outdated

Comment thread vllm_omni/diffusion/quantization/int8.py Outdated

Comment thread vllm_omni/diffusion/quantization/int8.py Outdated

yjb767868009 closed this Feb 25, 2026

yjb767868009 deleted the int8-quant branch February 25, 2026 07:58

yjb767868009 reopened this Feb 25, 2026

lishunyang12 mentioned this pull request Feb 25, 2026

[RFC] Q1 Quantization Support #1057

Closed

lishunyang12 reviewed Feb 25, 2026

View reviewed changes

lishunyang12 mentioned this pull request Feb 25, 2026

[RFC] Quantization + CPU offload: _process_weights_after_loading needs device-aware quantization #1474

Closed

yjb767868009 added 7 commits February 26, 2026 10:03

[Feature] Add Int8 quantization support for Z-Image and Qwen-Image

61dee65

Signed-off-by: juboyu <767868009@qq.com>

fix process_weights_after_loading

aa3598e

Signed-off-by: juboyu <767868009@qq.com>

fix DiffusionInt8Config init

e394819

Signed-off-by: juboyu <767868009@qq.com>

fix Int8Config's function from_config undefine Int8Config

f4282df

Signed-off-by: juboyu <767868009@qq.com>

fix format

b1c29a4

Signed-off-by: juboyu <767868009@qq.com>

fix format

2a95e10

Signed-off-by: juboyu <767868009@qq.com>

fix format

a3fcc33

Signed-off-by: juboyu <767868009@qq.com>

yjb767868009 force-pushed the int8-quant branch from 88f7f80 to a3fcc33 Compare February 26, 2026 02:05

yjb767868009 added 2 commits February 26, 2026 11:06

add quant_config_cls and fix import torch_npu

17aab52

Signed-off-by: juboyu <767868009@qq.com>

fix format

9ecc62f

Signed-off-by: juboyu <767868009@qq.com>

lishunyang12 mentioned this pull request Mar 1, 2026

[Benchmark] Add quantization quality benchmark script (LPIPS) #1575

Closed

4 tasks

yjb767868009 added 3 commits March 3, 2026 14:16

Merge branch 'main' into int8-quant

b8cb90d

Signed-off-by: JuboYu <767868009@qq.com>

fix invalid character

6173d12

Signed-off-by: juboyu <767868009@qq.com>

Merge branch 'main' into int8-quant

a117002

david6666666 reviewed Mar 3, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 3, 2026

View reviewed changes

yjb767868009 added 2 commits March 19, 2026 19:57

Fix the issue of quantization parameter passing, and add z_image as t…

084db20

…he prefix for quantization ignored_layers Signed-off-by: juboyu <767868009@qq.com>

Merge branch 'main' into int8-quant

9975e1e

Signed-off-by: JuboYu <767868009@qq.com>

fix format

e1cfe8c

Signed-off-by: juboyu <767868009@qq.com>

david6666666 approved these changes Mar 19, 2026

View reviewed changes

david6666666 enabled auto-merge (squash) March 19, 2026 14:47

hsliuustc0106 disabled auto-merge March 19, 2026 14:57

hsliuustc0106 merged commit b766c47 into vllm-project:main Mar 19, 2026
7 checks passed

lishunyang12 mentioned this pull request Mar 20, 2026

[FP8] enable hunyuan-image-3 diffusion model with fp8 online quant #1935

Merged

5 tasks

This was referenced Apr 4, 2026

[Feature] Add quantization support for NextStep-1.1 model #2482

Open

[Feat] support quantization for GLM-IMAGE #2292

Merged

[Feat][HunyuanVideo-1.5]Support vae-patch-parallel #2418

Open

lyj-jjj mentioned this pull request Apr 8, 2026

[RFC]: support DiT MM online FP8 quantization on NPU #2592

Open

1 task

This was referenced Apr 8, 2026

[Quantization] FP8 quantization framework for diffusion attention #1413

Closed

[RFC]: Continuous Quantization Support for NPU #2438

Open

lishunyang12 mentioned this pull request Apr 23, 2026

[NPU] [Quant] Support HunyuanImage3 offline quantization with vLLM-Ascend on diffusion path #2979

Merged

8 tasks

ultism mentioned this pull request May 23, 2026

[Diffusion][Quantization] SVDQuant W4A4 (Nunchaku) for Z-Image-Turbo #3830

Open

		replace_parameter(layer, "weight_scale", weight_scale)


		class DiffusionInt8Config(DiffusionQuantizationConfig):


		logger = logging.getLogger(__name__)

		CONDITION_IMAGE_SIZE = 384 * 384


		## Device Compatibility for Int8

		\| NPU Generation \| Int8 Mode \|

Conversation

yjb767868009 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Supported Models

Changes

How to Use

Test Plan

Test Result

Quantization Quality Benchmark for GPU

Quantization Quality Benchmark for Atlas A2

Memory Profiling

Qwen-Image

Z-Image

Related Issues

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lishunyang12 commented Feb 25, 2026

Uh oh!

yjb767868009 commented Feb 25, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 commented Feb 26, 2026

Uh oh!

yjb767868009 commented Feb 27, 2026

Uh oh!

david6666666 commented Mar 3, 2026

Uh oh!

hsliuustc0106 commented Mar 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

david6666666 commented Mar 19, 2026

Uh oh!

yjb767868009 commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

yjb767868009 commented Feb 25, 2026 •

edited

Loading